Castleman Disease (CD) is a heterogeneous group of rare lymphoproliferative disorders. The diagnosis requires histopathologic interpretation of lymph node biopsies where five key histologic features (atretic germinal centers, follicular dendritic cell prominence, vascularity, hyperplastic germinal centers, and plasmacytosis) are graded on an ordinal scale from 0 to 3. However, this process is somewhat subjective with variability among pathologists. We evaluated whether AI computational pathology techniques, specifically attention-based multiple instance learning (ABMIL), could automate CD grading reliably and with accuracy comparable to hematopathology experts.

We developed a proof-of-concept ABMIL model to predict slide-level CD histology scores from whole-slide images (WSIs) of H&E-stained lymph node tissue. Each WSI was divided into tiles, and a pre-trained foundation model (Virchow2) was used to extract embeddings (numerical representations of the image features). These embeddings were aggregated by the ABMIL model into slide-level predictions across the five established histologic features and additionally, follicular twinning. While not part of the formal CD grading criteria, follicular twinning is a recurrent morphologic finding of interest and was included in our study to evaluate whether the model could detect and score biologically relevant but non-canonical features.

Our dataset consisted of 154 WSIs featuring CD or CD-like histology, annotated by a group of eight hematopathologists for each feature. Model training and validation were performed using only slide-level grades.

To evaluate model performance and interpretability, we compared model predictions to expert consensus among eight hematopathologists using Krippendorff's alpha. We also compared model predictions against the range of interobserver agreement among expert hematopathologists who were not CD specialists. Leave-one-out analysis of hematopathologist graders confirmed significant inter-rater variability. Importantly, model disagreement was typically less than or equal to the average hematopathologist inter-rater spread: model predictions showed moderate concordance with expert “ground-truth” annotations, as quantified by a Krippendorff's α = 0.59, within the range of inter-rater variability seen among hematopathologists (mean α = 0.51, stddev = 0.22). Visualizations of tile-level attention weights confirmed that the model attends to diagnostically relevant regions and ignores irrelevant regions, supporting biological interpretability despite only using weak supervision.

This study demonstrates the feasibility of using ABMIL models to automatically score histologic features in CD with reliability comparable to human experts. The model's interpretability and agreement within the bounds of pathologist variation underscore its potential utility as a diagnostic aid or second reader and support further exploration into hybrid workflows combining human expertise with machine assistance, particularly in grading in rare hematologic diseases where inter-rater variability is a known challenge.

This content is only available as a PDF.
Sign in via your Institution